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Classification of Maldi-tof Mass Spectrometry 
Data in the Analysis of Cancer Patients 


The article presents a case study of Maldi-Tof (Matrix-Assisted Laser Desorption Ionization — Time Of Flight) 
data analysis and classification. Row mass spectrometry data are preprocessed and decomposed with Gaussian 
Mixture Model. Gaussian mask is calculated and put at all spectra separately. In further dimension reduction 
RFE, PLS and T test are used. The classification is done with Support Vector Machine (SVM) method with 
Gaussian Radial Basis Function kernel. 


Introduction 


Classification is essential part of mass spectrometry data. The most common classi- 
fication task is a division on two or more classes, like ill patients and healthy donors, 
positive or negative reaction on the medical treatment, stage of diseases. There are many 
papers concerning these issues on protein sequence and DNA data, microarray expressions 
or mass spectrometry data. 

Besides the main classification very important issue is dimension reduction and 
features selection techniques. This task determines success of the classification because of 
the specificity of mass spectra data. High dimensionality of data and significantly smaller 
number of observations can considerably disturb classification results. Classified objects 
are usually represented by vectors of observed, measured or calculated features. 

Supervised learning classification assumes, that there is unknown function ®, which 
assign to each object of population O a label of one class. Classification process is based 
on the learning set U which is a subset of the hole data set O. Each element o, of the lear- 


ning set is composed of the object representation and information about its class label. This 
object representation is observation vector of features. The hole set is divided into c 
separated subsets and one subset observations are numbered among only one of c classes. 
Supervised learning is widely used in biomedical applications. 


A construction of the prediction model 


On the basic of the single learning set multiple different classifiers. The ideal situ- 
ation would be to chose the proper classifier on the basic of the number of misclassifications 
of the new, random observation. However, in reality bad classification probabilities are 
unknown. They might be estimated from a validation probe. The validation probe is a 
random sample, independent of the learning probe, where objects’ membership to classes 
are unknown. Misclassification probability of specific classifier is estimated with mistaken 
classification done by the classifier on the validation probe. Classifier evaluation should be 
done using observations independent of those from the learning probe. In other cases the 
classifier is biased. 

The ultimate classifier evaluation is done with test probe. It needs to be independent 
of other probes and it needs to have information about objects’ membership to classes. If 
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only one classifier is to be tested or size of the set is small, the validation probe might be 
omitted. In practice, the usually chosen proportion is the division: 50% on the learning probe 
and 25% each for the validation and test probes [1]. However, the division depends on the 
specificity of the data set. 

The classifier makes the decision about the membership to classes on the basic of the 
learning probe. However in practice there is much more data among the learning set on 
which the classifier will work. It causes that the probability of wrong decision making is 
nonzero [2]. On the other hand usage of the classifier to the other data than it was built of 
causes that it should have ability of the generalization of learning set characteristics. In practice 
it means the ability of learning properties which are representative for all population with 
omitting those properties which are nonessential, which constituent only features of the 
specific learning set. 

The most popular measures of classification quality are: classification accuracy (for 
example a proportion of correctly classified sets) and error rate (for example a proportion 
of misclassified sets). Important signs are also: TP (True Positives) — the number of cor- 
rectly classified positive sets, TN (True Negatives) — the number of correctly classified nega- 
tive sets, FP (False Positives) — the number of incorrectly classified positive sets, FN (False 
Negatives) — the number of incorrectly classified negative sets. 

Among useful measures there are also sensitivity and specificity. The sensitivity is 
defined as a proportion of truly positives and false negatives results (eq. 1). It can be inter- 
preted as the classifier ability to identify the phenomenon where it really exists. 

pain, IE 
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On the other hand the specificity is a proportion of truly negatives results and a sum 
of truly negatives and positives results (eq. 2). The specificity is interpreted as the ability to 
reject truly false results. 
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The sensitivity and the specificity are opposed values — the increase of the one of 
them causes the decrease of the other one. 

The significant tool characterizing classifier’s features is receiver operating characte- 
ristic curve — known as a ROC curve. It is a chart of dependency between values: 1 — speci- 
ficity and the sensitivity. Such curve is created for a specific structure of the classifier 
(specified type, parameters, number of input features). The total error of presented classifier 
remains unchanged. However, its division on values FP and FN is changed, because the ROC 
curve examines the proportion between FP and FN. In case of random division of objects the 
ROC curve will take a shape of a straight line going through the bottom left and upper right 
comers. The better classification results are, the more concave the curve is. The ideal situation 
would make the ROC curve go through the upper left corner of the chart. 


SVM classifier 


The Support Vectors Machines (SVM) is young but widely used classifier. It was 
proposed by V.N.Vapnik [3-5]. The idea of this method is classification with usage of ap- 
propriately designated discriminant hyperplane. Searching of such hyperplane needs Mer- 
cer theorem and optimization of quadratic objective function with linear restrictions. 

If learning sub-sets are fully separable, the SVM idea is to find two parallel hyper- 
planes, which delimit the wider area do not containing any probe elements. To accept those 
terms the hyperplanes need to be based on some of the probe elements. Such elements are 


specificity = 
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called support vectors. The discriminant hyperplane is put in the middle of the resultant 
area. If learning sub-sets are not linearly separated, the penalty is introduced. The best sepa- 
ration is obtained for higher dimension space. 

The SVM rule takes the form of (eq. 3). 


fo) =sen| >» rata 98" (3) 
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where a@ are Lagrange’s coefficients and b is a constant value. For inseparable classes the 
additional restrictions take the form of (eq. 4). 
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where ¢, is a constant value ¢, 20. 


The more complicated classification problems are solved with use of kernel functions. 
Such construction enables to obtain non-linear shapes of discriminant hyperplanes. The SVM 
rule with kernel takes the form of (eq. 5). 
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where K(x,x) is a kernel. One of the most popular kernel function is radial kernel (eq. 6). 


K(x,x') = exp(—|x—x" * Ie) (6) 


Dimension reduction techniques 


Input data-set for classification usually contain several hundreds or even thousands 
of features. From the statistical point of view using such number of features is unreaso- 
nable. There are many reduction and selection techniques available. They attempt to find 
the smallest data sub-set chosen with defined criteria among the hole data set. Too large 
number of features has an adverse impact on the classification results. Especially biological 
data, like mass spectrometry and microarray data fit to this characteristic. Large features 
number causes increase of computational complexity and lengthen of calculation time. 
Moreover, large number of features has an influence on low quality of classification. It is 
due to features correlation. This makes the analysis difficult and the diversification is hard 
to obtain [2]. Large number of parameters causes also large number of classifier’s para- 
meters. It increases its complexity and susceptibility on over learning and decreases its 
flexibility. The existence of the curse of dimensionality [6] proves, that the complexity of 
the classifier has an effect on the classification quality. The more complex classifier is, the 
higher should be the proportion between number of observation and number of features [7]. 

There are two types of methods: 


1) features extraction — data are undergone transformation — new data set is obtained; 


2) features selection — sub-set of the most optimal data is chosen. 

One of commonly known features extraction methods is Partial Least Squares (PLS) [7]. 
It enables also classification. Features selection in PLS method is performed with use of 
both X and Y data. So it enables using structure of the hole learning data set. 

The idea of PLS method is to find latent vectors. Using of latent vectors enables 
simultaneous analysis and decomposition of X and Y including covariance between X and Y. 
Such approach makes PLS a special case of Principal Component Analysis (PCA) [5]. 
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The decomposition of X and Y is done to low-dimensional space of hidden variables. 
Independent variables X are decomposed according (eq. 7). 


X=TP" +E. (7) 
where 7'T =/, J — identity matrix, T—score matrix and P— loading matrix. A product of 


T and P gives good estimation of X matrix. 
Dependent variables Y are decomposed as(eq. 8). 


Y¥=UQ' +E,. (8) 
The final model of PLS describing Y <= _X regression is (eq. 9). 
Y= X(PB,Q')+E=XB+E. (9) 


SVN-RFE (Support Vector Machine Recursive Feature Elimination) [8| method is 
features selection method. Features selection is done with propagation backward method. 
The procedure starts with full range of input features and features are ranged successively 
removed. Only one feature is removed in a time. As a rang criterion SVM weights coef- 
ficients are used. Therefore SVM-RFE method is closely related to SVM classification. 

In SVM-RFE procedure SVM classification might be formulated as in eq. 10. 
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Eq. 10 is solved with eq. 11: 
min Ya.a,y.y, kxx,)- Da, (11) 
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where Ke(x,x ;) 18 a kernel function. 
The SVM-RFE objective function is 
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Changes in the objective function caused by features elimination may be written 
using the Taylor series (eq. 13). AJ(i) in the optimal point takes the value AJ(i) =(Aw,)’, 
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where w,is the 7 removed feature. 
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Very common technique of feature selection is 7 test. The most significant features 
according the T test are chosen. For each feature a T test range is calculated with eq. 14. 


My - 4;| 
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where y;, 4; denote the mean values for i" feature calculated for respectively positive and 
negative samples. Similarly o;,o, denote standard deviations and n*,n denote numbers 


of positives and negatives learning samples. 
The T statistics treats all feature as independent. This assumption is usually not met. 
However, 7 test is successfully used for protein data classification. 
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Characteristic of the data set 

The data set presented in the paper is Maldi-Tof (Matrix-Assisted Laser Desorption 
Ionization — Time Of Flight) mass spectra data. Classification and dimension reduction 
methods was applied to this data set. One is serum albumin spectra obtained in the study on 
head cancer patients and healthy donors. The data set contains 100 data files, each of them 
is confirmed with four repetition. Each sample was taken from a person two times and each 
of those two samples was analyzed two times. The aim of the analysis is to detect peaks 
and find its biological interpretation. Very important part of this analysis is classification 
presented in this paper. 

Each of the data set file contains 45 thousands of points. Typical mass spectrum is 
composed of two data vectors: M/Z value (X axis) and intensities (Y axis). Spectra, before 
the analysis, must be preprocessed. Preprocessing steps involve: binning, interpolation, 
normalization, baseline correction, normalization, denoising, peaks detection [9] and 
alignment [10]. One of the most important preprocessing steps is denosing, especially a 
baseline correction. Baseline is a special case of noise, intensifying especially in initial part 
of the spectrum, where M/Z values are low. Removal of this kind of noise flattens and 
averages the spectrum. The baseline correction is essential for further analysis and 
improves the quality of it. It is usually performed with multiple shifted windows with 
defined width. Normalization and interpolation are useful techniques helpful analyzing and 
comparing few spectra simultaneously. Interpolation is useful during the unification of 
measurements points [11] along with m/z axis of all spectra. Normalization [12], [13] is 
scaling all spectra to a single value of area under the curve. This scaling is usually done for 
the total ion current (TIC) value or for the constant noise. An example of analyzed 
spectrum with Baseline correction result is presented at Fig. 1. 


Original spectrum 


Baseline 


Baseline points 
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Figure 1 — An example of mass spectrum 


A method which is used for mass spectra analysis is based on the Gaussian mixture 
decomposition. Data are modeled with Gaussian mixture models (GMM). The fitting is 
done with Expectation-Maximization algorithm (EM) performing maximizing the likelihood 
function. The analysis may be performed for several spectra simultaneously by use of the 
mean spectrum to make calculations faster and more efficiently. 

Using of this method enables preliminary dimension reduction. After preprocessing 
and averaging over 4 repeated spectra of one person the mean spectrum was calculated. 
Next it was modeled with GMM with defined number of components (300). 
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A mixture model, as a combination of a finite number of probability distributions 
mix K 
fs Os Pps PK) = ee PH) (15) 
where K is the number of components in the mixture and a, ,k =1,2,...,K are weights of par- 


ticular component, 2 a, =1. Gaussian distribution (eq. 16) is given with two parameters: 
=] 


mean yw, and standard deviation o,. Distributions in the mixture are also specified with 
additional parameters — weights, which determine their contribution to the hole mixture. 


2 
1 (Xp — Hy) 
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The Expectation-Maximization algorithm is nonlinear method is composed of two 
main steps performed in the loop. The expectation step (E) consists in calculation of distri- 
bution of hidden variables (eq. 17) 
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The maximization step (M) calculates new mixture parameters values. In case of 
GMM M step is given with (eq. 18). 
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The decomposition results are used as a Gaussian mask which was put on every 
single spectrum in the data set. This gives new values consisting the spectra. Dimensions 
of spectrometry data decreased to the value of GWM components number. The result 
matrix obtained after those steps was: nxm, where n denoted number of spectra and k — 
number of components. 

The resultant matrix was the input data to the further dimension reduction and clas- 
sification. 


Results 


The classification analysis was performed using all three presented dimension reduc- 
tion techniques. The classification was done with SVM classifier with radial kernel. Tests 
of classification and reduction performance were done for different values of SVM para- 
meters and number of selected features. To find the most accurate values, division of the 
data set into testing and learning subsets and classification calculations need to be repeated 
several hundred times. All calculations were done in Matlab environment. The SVM para- 
meters are: value of box constraints (C) for the soft margin and the scaling factor (sigma). 
Results of multiple repetitions of SVM for different sigma values is presented on Fig. 2. 
According results the sigma value was estimated at 12 and C — as 4000. 
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Figure 2 — Searching for optimal sigma value 


If parameters are known, there is a necessity of finding optimal number of features. 
If there are 50-elements learning data set, number of features shouldn’t be larger than 10. 
The results for all three types of dimension reduction techniques are presented on Fig. 3. 
The middle line is the obtained ratio and the upper and lower denotes the confidence 
interval. Similar results are obtained for FN and FP values. Fig. 4 presents typical ROC 


curve determined for the SVM-FRE of six features. 
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Figure 3 — Results of classification after dimension reduction 
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Figure 4 — The ROC curve 
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Conclusion 


Proteomic, especially mass spectrometry data, need for special processing and ana- 
lyzing. The specificity of data makes them hard to classify. Special dimension reduction 
techniques needs to be used. The most common technique, T test, gives good, but the wea- 
kest results. The most reliable is the SVM-RFE technique. It is highly connected with SVM 
classification method what makes it suitable for MS data. SVM classifier and its variants is 
nowadays one of the most popular classification technique among proteomic research. 
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Maazzoscama Inexaecoxa 

Kuaaccugukanna Maldi-Tof macc-chekrpoMerpH4ueckux JaHHbix B HCCJ1IeqOBaHuAXx 

OHKOJIOFH4eCKHX OOJIbHBIX 

I[peqmoxena kiaccuukallua Macc-ClleKTpOMeTpHYeCKHX aHHBIX MeJMIMHCKHX UccieqoBaHnit Maldi-Tof, 
asIrOpHTMBI UM MpOrpaMMbI KOMMBbIOTepHOTO MOJeIMpOBaHHA, KOTOPbIe MCHOUb3yIOTCA B mpoOmemax 
JMarHOCTHKH UM JIedeHHA paKOBbIX 3a00]eBaHHi. 


Mazeoscama Hnexaecoxa 

Kuaacugikania Maldi-Tof mac-cnekrpomMeTpH4unnx aHHx y OCiKeHHAX OHKOJIOTiIMHHX XBOPHX 
3anponoHoBaHa Kilacu(ikallid@ Mac-CIleKTpOMeTpHYHUX aHHxX MeMYHHX DociKeHb Maldi-Tof, anroputmu 
Ta lIporpaMH KOMIT’IOTepHOTO MOJeIOBAHHA, AKi BHKOPMCTOBYIOTbCA B MpoOseMaTHLi AlarHocTHKH i 
JUKYBaHHA pakOBHX 3aXBOPIOBaHb. 
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